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Abstract 

The inverse temperature parameter of the Potts model governs the 
strength of spatial cohesion and therefore has a substantial influence over 
the resulting model fit. A difficulty arises from the dependence of an in¬ 
tractable normalising constant on the value of this parameter and thus 
there is no closed form solution for sampling from the posterior distri¬ 
bution directly. There are a variety of computational approaches for 
sampling from the posterior without evaluating the normalising constant. 

These algorithms differ in their levels of accuracy and their scalability 
for datasets of realistic size. This study compares pseudolikelihood, the 
exchange algorithm, path sampling, and approximate Bayesian computa¬ 
tion. We assess their accuracy and scalability using synthetic data as well 
as 2D remote sensing and 3D medical images. Our findings provide guid¬ 
ance on selecting a suitable algorithm for Bayesian image analysis. For 
nontrivial images, this necessarily involves some degree of approximation 
to produce an acceptable compromise between accuracy and computa¬ 
tional cost. 
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1 Introduction 


Markov random field (MRF) models have seen widespread use in image analysis 
since their introduction by Besag [1974], as surveyed by Winkler [2003] and Li 
[2009]. A MRF is a generalisation of the Markovian dependence structure to 
more than one dimension: satellite imagery has two spatial dimensions, while 
medical images such as computed tomography (CT) are three-dimensional. The 
hidden Potts [1952] model employs a latent MRF on discrete states to describe 
spatial dependence between adjacent neighbours. The degree of dependence in 
the model is governed by a parameter known as the inverse temperature due to 
its origin in statistical physics. It is difficult to set this parameter by trial and 
error, particularly for noisy images. Rather than using a fixed value, it would be 
preferable to estimate the inverse temperature as part of the model. However, 
the intractable normalising constant of the Potts model depends on the value 
of the inverse temperature, which means that there is no closed form solution 
for estimating its posterior distribution. 

Ryden and Titterington [1998] derived a pseudolikelihood (PL) approxima¬ 
tion [Besag, 1975] to the intractable posterior density. Gelrnan and Meng [1998] 
instead approximated the ratio of normalising constants using thermodynamic 
integration (TI), also known as path sampling. Mpller et al. [2006] introduced 
an auxiliary variable method that gives an exact MCMC algorithm for the spe¬ 
cial case of a 2 component Potts model, also known as an Ising model. Murray 
et al. [2006] proposed a variant of the exact method known as the exchange 
algorithm or multiple auxiliary variable method (MAVM). By replacing the ex¬ 
pensive perfect sampling step [Propp and Wilson, 1996] with an approximation 
such as Gibbs sampling, Cucala et al. [2009] developed an approximate exchange 
algorithm that can be applied for Potts models with k > 2. Friel et al. [2009] in¬ 
troduced the reduced dependence approximation (RDA) which uses recursion to 
calculate the normalising constant on small sub-lattices. McGrory et al. [2012] 
generalised RDA to an irregular lattice. Grelaud et al. [2009] used the sufficient 
statistic of the Potts model to estimate the inverse temperature using approx¬ 
imate Bayesian computation (ABC). Everitt [2012] combined the approximate 
exchange algorithm with particle Markov chain Monte Carlo (PMCMC) and 
also implemented ABC with sequential Monte Carlo (SMC-ABC) for the Ising 
model. Although all of these methods work very well in theory, the largest 
image used in any of these papers to demonstrate an algorithm is less than ten 
thousand pixels. This does not give a reliable indication of how these algorithms 
might perform when applied to images of a more substantial size. 

Images containing multiple megapixels are now commonplace, from digital 
photography and high-definition (HD) video to medical imaging and remote 
sensing. Table 1 illustrates how the number of pixels translates to real world 
scale for various types of images, including the area covered by a satellite image, 
the number of axial slices of a CT scan, and the number of frames of HD 
video. Multispectral images from the Landsat 7 [NASA, 2011] and Landsat 8 
[USGS, 2014] satellites have a spatial resolution of 30 metres per pixel (area 
of 900m 2 ). A Landsat image covers an area of approximately 170km north- 
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Number 
of pixels 

Satellite 

(900m 2 /px) 

CT slices 
(512x512) 

HD Video 

1080p 

2 6 

0.06km 2 



5 6 

14.06km 2 

0.1 


10 6 

900.00km 2 

3.8 

0.5 

15 6 

10251.56km 2 

43.5 

5.5 


Table 1: Scale of common types of images. 


south by 183km east-west. Tomographic reconstructions such as CT scans are 
usually represented as 3D image stacks, with 512 x 512 pixels per slice. The 
pixel resolution and slice width varies depending on the clinical protocol. A 
single frame of HD video is typically 1920 x 1080 pixels with 24 frames per 
second (fps), although higher frame rates and resolutions (such as ultra high 
definition, UHD) are available. The volumes of data involved in video processing 
necessitate specialised methods, which are beyond the scope of this paper. For 
examples of Bayesian methods for video analysis, see Simoncelli [1999], Minvielle 
et al. [2010], and the references therein. The remainder of the discussion will 
focus on static images. 

This is the first study to systematically investigate the scalability of methods 
for intractable likelihoods. In this paper we compare the performance of pseudo¬ 
likelihood (PL), the approximate exchange algorithm (MAVM), thermodynamic 
integration (TI), and approximate Bayesian computation (ABC). By conducting 
these experiments we are able to address questions such as how the computa¬ 
tional cost increases with the size of the images, as well as how much accuracy is 
lost by using faster, more approximate methods. The imaging datasets, which 
include 2D remote sensing and 3D medical imaging, are described in Section 2. 
We define the hidden Potts model in Section 3 and examine the properties of 
its intractable normalising constant. The various computational methods for 
estimating the inverse temperature are reviewed in Section 4. We present our 
experimental results in Section 5 and conclude the article with a discussion. 


2 Example datasets 

2.1 Synthetic data 

The inverse temperature cannot be directly observed, so the only way to mea¬ 
sure the accuracy of a method is to use simulated data. We have adapted a 
method that was proposed by Cook et al. [2006] for the purpose of assessing 
the correctness of software for fitting a Bayesian model. Their central concept 
is to draw values of the model parameters from the prior and simulate data 
using those parameters, then fit the model and compare the estimate of the 
posterior distribution to the known value. We do not make use of the y 2 test 
devised by Cook et al., since our goal is not to evaluate the hypothesis that our 
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implementation of the algorithm is correct. We know that all four algorithms 
are approximations, therefore our aim is to compare the performance of these 
algorithms under various conditions. 

We use a uniform prior to draw values of the inverse temperature. This prior 
is adopted because it reflects current practice and enables comparison between 
our results and other recent studies. We use informative, conjugate priors for 
the means and variances of the mixture components in our simulation study. 
This avoids the problem of label switching and corresponds with the model that 
we use in our applications to real data. We use a fixed number of 3 mixture 
components to generate 6 datasets containing 20 images each. The images 
within a dataset all have the same dimensions. The small images have 2 6 pixels 
(8x8 for 2D images, or 4 x 4 x 4 for 3D). The medium images have 5 6 (125 x 125, 
or 25 x 25 x 25). The large images have 10 6 (1000 x 1000, or 100 x 100 x 100). 
This enables us to study how the performance of the algorithms changes as the 
size of the images increases, as well as how well these algorithms generalise to 
a 3D lattice. 


2.2 Satellite remote sensing 

Landsat imagery of the Greater Brisbane region was converted to the Normalised 
Difference Vegetation Index (NDVI). This index is calculated using the visible 
red (RED) and near-infrared ( NIR ) bands: 


NDVI = 


NIR - RED 
NIR + RED 


( 1 ) 


NDVI is generally used to detect vegetated areas, with an expected value 
for vegetation being between 0.3 and 0.8. In this analysis, we are interested 
in detecting clusters which relate to vegetated areas. The size of the data set 
does not enable us to individually view the satellite image of Brisbane, however, 
as indicated in Figure lb, a mixture model may be an appropriate means of 
determining clusters, and attributes of clusters, specifically those related to 
vegetation. In addition, a further goal of data analysis is detecting loss (or 
gain) of moderate sized vegetative areas through time. A mixture model may 
also suit this task, as pixel movement between clusters may indicate a change 
in habitat or land use. 


2.3 Medical images 

Computed tomography (CT) is a technology for reconstructing a three-dimensional 
image by filtered back-projection from X-ray beams [Kalender, 2011]. The pixel 
intensities in a CT scan are linearly related to the electron density (ED) within 
the object that is imaged. CT scanners are calibrated in Hounsfield units (HU) 
with water at 0 HU and air at -1024 HU. 

Rather than using biological images, which are difficult for the untrained 
eye to interpret, in this paper we have used 27 CT scans of an ED phantom 
(CIRS, Inc. model 062). This object is manufactured from epoxy and contains 
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(a) Satellite image of Brisbane, Australia. 



-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 

Pixel Intensity 


(b) Histogram of NDVI values. 

Figure 1: A satellite image and the distribution of pixel intensities. 
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(a) CT scan of the ED phantom. 



-1000 0 1000 2000 3000 

Hounsfield units 


(b) Histogram of CT numbers (HU). 

Figure 2: An axial slice of a CT scan and the distribution of pixel intensities. 
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cylindrical inserts that mimic the X-ray absorption of human tissue. Since 
the geometry and density of this object is known, the CT scans can be used to 
explore the effect of inverse temperature on segmentation accuracy. An example 
slice of a CT scan of the ED phantom is illustrated in Figure 2. The cylindrical 
inserts are arranged in pairs, clockwise from the top: lung (exhale); adipose 
(fatty tissue); dense (cortical) bone; muscle; lung (inhale); breast (50% fat); 
spongy (trabecular) bone; and liver. The body of the phantom is composed of 
water-equivalent solid and therefore should have a mean pixel intensity of 0 HU. 


3 Hidden Potts model 


Image segmentation can be viewed as the task of labelling the observed pixels 
y according to a finite set of discrete states z £ {1, ..., k}. The hidden Potts 
model allows for spatial correlation between neighbouring labels in the form of 
a Markov random field. The latent labels follow a Gibbs distribution, which is 
specified in terms of its conditional probabilities: 


p{zi\z\i,/3) 


exp {PJ2i~efi( z i, z e)} 

Ej=i ex P 


( 2 ) 


where j3 is the inverse temperature, z\ t represents all of the labels except z,, 
i ~ £ are the neighbouring pixels of i, and S(u , v) is the Kronecker delta function. 
Thus, z l) is a count of the neighbours that share the same label. 

If the labels z t are indexed row-wise, the nearest (first-order) neighbours i ~ l 
in a regular 2D lattice with c columns are {zj_i, Zj_ c , Zj+ C , Zj+i}. Pixels situated 
at the boundary of the image domain have less than four neighbours. Likewise, 
voxels in a regular 3D lattice have a maximum of 6 first-order neighbours. These 
neighbourhood relationships are reciprocal, so h £ * ~ (. implies i £ h ~ i. If 
£ is the set of all unique neighbour pairs, or edges in the image lattice, then 
\£\ = 2 (n — \/n) for a square lattice and 3(n — n 2 / 3 ) for a cube. 

The observation equation links the latent labels to the corresponding pixel 
values: 

n 

p{y\z,0) = Y[p{yi\zi,9 Zi ) (3) 

i=1 

where 0j are the parameters that govern the distribution of the pixel values with 
label j. The hidden Potts model can thus be viewed as a spatially-correlated 
generalisation of the finite mixture model [Ryden and Titterington, 1998]. Green 
and Richardson [2002] used a Poisson likelihood for (3), with intensity A j. In¬ 
stead we follow Geman and Geman [1984], Alston et al. [2007], and many others 
in assuming that the pixels with label j share a common mean corrupted by 
additive Gaussian noise with variance <r 2 : 


Vi\zi =3, Pj ~ -A7 (nj , cr 2 ) 


(4) 
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The Gibbs distribution is a member of the exponential family and so there 
is a sufficient statistic for this model, as noted by Grelaud et al. [2009]: 

S(z) = Y S ( z i’ z e) ( 5 ) 

i~e&£ 

This statistic represents the total number of like neighbour pairs in the image. 
The likelihood p(y,z\8,/3) can therefore be factorised into p(y|z, 8)p(S(z)\/3), 
where the second factor does not depend on the observed data, but only on the 
sufficient statistic. The joint posterior is then: 


p{0, P, z|y) oc p(y|z, 0)n(e)p(S(z)\P)n(P) (6) 

The conditional distributions p(0 |z,y) and p(zi\z\i, P,yi,8 Zi ) can be simulated 
using Gibbs sampling, but p(/3|y,z,0) involves an intractable normalising con¬ 
stant C(p): 


p{P I y,z ,0) 


OC 

oc 


p(S(z)|/3)tt(/3) 

exp{/3S(z)} _ r( ^ 


(7) 

( 8 ) 


The normalising constant is also known as a partition function in statistical 
physics. It has computational complexity of 0(nk n ), since it involves a sum 
over all possible combinations of the labels z £ Z\ 


C(P) = Y ex P(^ S ( z )} 

zG-2 


(9) 


It is infeasible to calculate this value exactly for nontrivial images, thus compu¬ 
tational approximations are required. 

The conditional expectation of S(z) given /? can be expressed in terms of the 
normalising constant: 

Ez| /3 [ s( z)] = ^ 1 ° g{ C ( /3 )} (10) 

As P approaches infinity, all of the pixels in the image are almost surely assigned 
the same label, thus the expectation of S(z) approaches the total number of 
edges |£| asymptotically, while the variance approaches zero. When ,3 = 0, (2) 


simplifies to exp{0}^ 


-l 


hence the probability of any pair of neighbours 
being assigned the same label follows an independent Bernoulli distribution with 
p = k~ x . In this case, S(z) follows a Binomial distribution with expectation 
|£|/fc and variance |£|fc -1 (l — A: -1 ). In a finite image lattice the distribution of 
S(z) changes smoothly between these two extremes, as illustrated by Figure 3, 
but its computation is intractable for nontrivial images. 

The Potts model undergoes a phase transition at the critical value of ft, 
switching from a disordered to an ordered state. Potts [1952] showed that the 
critical value for a 2D lattice on a cylinder can be calculated exactly according 
to: 


Per it = log 11 + Vk 


( 11 ) 




(a) Expectation for n = 12 and increas- (b) Expectation for k = 3 and increas¬ 
ing values of A: € {2,3,4}. ing values of n £ {4, 6, 9,12}. 




P P 


(c) Standard deviation for n = 12 and (d) Standard deviation for k = 3 and 

increasing values of k £ {2, 3, 4}. increasing values of n £ {4,6, 9,12}. 

Figure 3: Distribution of the sufficient statistic of the Potts model for increasing 
values of the inverse temperature /?, the number of pixels n, and the number of 
unique labels k. The expectation and standard deviation of S(z) were calculated 
exactly, using a brute force method. 
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The periodic boundary condition does not apply to the lattices considered in 
this paper, so for example the critical value for the images in Figure 3 is different 
to (11). However, the error introduced by the finite boundary diminishes as n 
increases. Figure 4 shows that (11) is very accurate in predicting the behaviour 
of S(z) for a 2D image with a maximum value of S(z) for first-order neighbours of 
1,954,672. This corresponds to the satellite image described in Section 2.2, with 
k=6 mixture components and P cr it ~ 1.24. S(z) is approximated by simulation 
using the algorithm of Swendsen and Wang [1987]. Figure 4a shows that the 
gradient of the expectation becomes very steep near the critical region, which is 
reflected in the super-exponential increase of the standard deviation illustrated 
by Figure 4c. As n —oo the derivative of the likelihood at the critical point, 
and hence the variance, is unbounded [Pickard, 1987]. This heteroskedasticity 
has important implications for many of the methods discussed in Section 4. 

There is no exact formula for p cr n in 3D images, although Hajdukovic [1983] 
developed an empirical approximation that works reasonably well: 

Pcrit ~ | log | ^ (V2 + V4k - 2^ | (12) 

The behaviour of IE z |/3 [S(z)] for a 3D image with k = 9 and P crit « 0.86 is 
illustrated by Figure 4b. This is typical of the CT scans described in Section 2.3, 
with |£| approximately equal to three million. The standard deviation is shown 
in Figure 4d. It is evident that (12) has underestimated P cr n since the maximum 
value of the standard deviation occurs at p = 0.9. 

4 Computational methods 

In this section we describe the four algorithms that are evaluated in Section 5. 
These methods provide alternative means to simulate parameter values from 
(8) without computing the intractable normalising constant. We describe the 
algorithms in terms of Markov chain Monte Carlo (MCMC) to enable direct 
comparison, although we also mention other approaches where applicable, such 
as particle-based (SMC and PMCMC) methods. Reference implementations of 
all of these methods are available from various sources described below, but 
for the purpose of comparison we have reimplemented the algorithms using 
RcppArmadillo [Eddelbuettel and Sanderson, 2014]. An R source package con¬ 
taining our code is available in Moores and Mengersen [2015]. 

4.1 Pseudolikelihood and composite likelihood 

Pseudolikelihood is the simplest of the methods that we have considered and 
also the fastest. Ryden and Titterington [1998] showed that the intractable 
distribution (8) could be approximated using the product of the conditional 
densities given by (2). This enables updates for the inverse temperature at 
iteration t to be simulated using a Metropolis-Hastings (M-H) step, as shown 
in Algorithm 1. 
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(a) Expectation for a 2D Potts model (b) Expectation for a 3D Potts model 
with k = 6. with k = 9. 







0.0 0.5 1.0 1.5 2.0 

P 


(c) Standard deviation for a 2D Potts 
model with k = 6. 



n 


(d) Standard deviation for a 3D Potts 
model with k = 9. 


Figure 4: Approximate distribution of S(z) using Swendsen-Wang for a 2D 
image with k = 6 and \£\ ~ 2 x 10 6 (left) and a 3D image with k = 9 and 
\£| ~ 3 x 10 b (right). The vertical line is the critical value of /3. 
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Algorithm 1 MCMC with pseudolikelihood (PL) 

l: Initialise /3o, /x 0 , c^o, zo 

2: for 1 £ 1... T do 

3: Update the labels z t ~ p{y i \z i ,ii Zi ,a 2 Zi )p{z i \z\ <i ,P) \/i G {1,... ,n} 

4: Calculate sufficient statistics Vj,s 2 Mzi = j, Mj £ {1,..., k} 

5: Update the noise parameters pLj, a 2 ~ p(yj, s 2 \p,j , a 2 ) n(nj\a 2 ) i t(ctJ) 

6: Draw proposed parameter value /3' ~ q(P'\fit-i) 

7: Approximate p(z|/3') and p(z\/3 t -i) using: 


p(z|/3) « \\p{zi\z\i,p) 

i=i 


8 

9 

10 

11 

12 

13 

14 

15 


Calculate the M-H ratio p = min ^1, 

Draw u ~ Uniform[0,1] 

if u < p then 

Pt^F 

else 


p( z |A-i)i r (/3t-i)9(/3'|/3t-i) 


) 


Pt A-i 

end if 
end for 


(13) 


The M-H proposal density q((3'\/3 t -i) can be any distribution such that 
/ q(j3’\l3t-i) dj3’ = 1. However, there is a tradeoff between exploring the full 
state space and ensuring that the probability of acceptance (at line 10) is suffi¬ 
ciently high. We use the adaptive random walk (RWMH) algorithm of Garth- 
waite et al. [2010], which automatically tunes the bandwidth of the proposal 
density to target a given M-H acceptance rate. When a symmetric proposal 
density is used, q(/3'\f3 t -i) = and so this term cancels out in the M-H 

ratio on line 8. The natural logarithm of p is used in practice to improve the 
numerical stability of Algorithm 1. 

Pseudolikelihood is exact when /3 = 0 and provides a reasonable approxima¬ 
tion for small values of the inverse temperature. However, the approximation 
error increases rapidly for /3 > /3 cr it, as illustrated by Figure 5. This is due 
to long-range dependence between the labels, which is inadequately modelled 
by the local approximation. The implications of this inaccuracy for posterior 
inference will be demonstrated in Section 5. 

Ryden and Titterington [1998] referred to Equation (13) in Algorithm 1 
as point pseudolikelihood, since the conditional distributions are computed for 
each pixel individually. They suggested that the accuracy could be improved 
using block pseudolikelihood. This is where the likelihood is calculated exactly 
for small blocks of pixels, then (13) is modified to be the product of the blocks: 

n b 

P^\p) ~ Y[p( z Bi\ z \Bi,P) (14) 

2=1 
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p 


(a) Expectation. 



P 


(b) Standard deviation. 

Figure 5: Approximation error of pseudolikelihood for n = 12, k = 
3 in comparison to the exact likelihood calculated using a brute force 
method: (a) S(z)p(S(z)|/3) using either Equation (8) or (13); (b) 

\/Ezgz ( S ( z ) “ Ez|/3[S(z)]) 2 p(S(z)|/3) 


13 











where Nb is the number of blocks, zg, are the labels of the pixels in block 
Bi, and z \B t are all of the labels except for zg r This is a form of composite 
likelihood, where the likelihood function is approximated as a product of sim¬ 
plified factors [Varin et ah, 2011]. Friel [2012] compared point pseudolikelihood 
to composite likelihood with blocks of 3 x 3, 4 x 4, 5 x 5, and 6x6 pixels. Friel 
showed that (14) outperformed (13) for the Ising (k = 2) model with /? < /3 cr it- 
Okabayashi et al. [2011] discuss composite likelihood for the Potts model with 
k > 2 and have provided an open source implementation in the R package potts 
[Geyer and Johnson, 2014]. 

Evaluating the conditional likelihood in (14) involves the normalising con¬ 
stant for z B t , which is a sum over all of the possible configurations Zb. . This is 
a limiting factor on the size of blocks that can be used. The brute force method 
that was used to compute Figure 3 and 5 is too computationally intensive for 
this purpose. Pettitt et al. [2003] showed that the normalising constant can be 
calculated exactly for a cylindrical lattice by computing eigenvalues of a k r x k r 
matrix, where r is the smaller of the number of rows or columns. The value 
of (9) for a free boundary lattice can then be approximated using path sam¬ 
pling. Friel and Pettitt [2004] extended this method to larger lattices using a 
composite likelihood approach. 

The reduced dependence approximation (RDA) is another form of composite 
likelihood. Reeves and Pettitt [2004] introduced a recursive algorithm to cal¬ 
culate the normalising constant using a lag-r representation. Friel et al. [2009] 
divided the image lattice into sub-lattices of size r± < r, then approximated the 
normalising constant of the full lattice using RDA: 


m 


Cr^n{P) r - ri + 1 

C ri -l X n(P) r - ri 


(15) 


McGrory et al. [2009] compared RDA to pseudolikelihood and the exact method 
of Mpllcr et al. [2006], reporting similar computational cost to pseudolikelihood 
but with improved accuracy in estimating /3. Source code for RDA is available 
in the online supplementary material for McGrory et al. [2012], 


4.2 Thermodynamic integration 

Gelman and Meng [1998] derived an approximation to the log ratio of normal¬ 
ising constants using the path sampling identity: 

log { C ^= i y } = I*' 1 Ezi/3[S(z)] d/3 (16) 

which follows from the definition of E z | j g[S(z)] in (10). The value of this definite 
integral can be approximated by simulating from the Gibbs distribution for 
fixed values of f3 and then interpolating between them. We used the Swendsen- 
Wang algorithm for simulating from z|/3, as mentioned in section 3. Figure 6a 
illustrates linear interpolation of S(z) on a 2D lattice for k = 6 and (3 ranging 
from 0 to 2 in increments of 0.05. This approximation can be precomputed 
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using the same method that was used to produce Figure 4. Figure 6b illustrates 
interpolation of S(z) on a 3D lattice for k = 9. 


Algorithm 2 Path sampling (TI) 

1: Initialise /3o, Mo; ^oi z o 

2: for f £ 1 ... T do 

3: Update the labels Zi ~ p(yi\zi, p Zi , a 2 .) p{zi\z\ i , j3) V* £ {1,..., n} 

4: Calculate sufficient statistics S(z) and Vj,s 2 \/zi = j, Vj £ {1,..., k} 

5: Update the noise parameters Pj , a 2 ~ p{yj, s 2 \pj , a 2 ) n(pj\a 2 ) 7r(crj) 

6: Draw proposed parameter value (3' ~ q(/3'|/3 t _i) 

7: Estimate E z |^[S(z)] for /3 £ {/3',/3t_i} by interpolation 

8: Evaluate the definite integral in Equation (16) 

9: Calculate the log M-H acceptance ratio: 

Tr(P')q(Pt-i\P') 

7r(/3 t -i)g(/3'|A-i) 
(17) 

10: Draw u ~ Uniform[0,1] 

ll: if u < p then 

12: fi t £- P' 

13: else 

14: 0 1 £- A_! 

15: end if 

16: end for 


log{p} = min ^0, log | | + (/3' - A-i)S(z) + log | 


Path sampling is explained in further detail by Chen et al. [2000, chap. 5]. 
A reference implementation in R is available from the website accompanying 
Marin and Robert [2007]. This algorithm has an advantage over auxiliary vari¬ 
able methods such as the exchange algorithm or ABC because the additional 
simulations are performed prior to fitting the model, rather than at each itera¬ 
tion. This is particularly the case when analysing multiple images that all have 
approximately the same dimensions, as in the examples of section 5. However, 
the computational cost is still slightly higher than pseudolikelihood, which does 
not require a pre-computation step. 

4.3 Pseudo-marginal methods 

Mpller et al. [2006] demonstrated that it is possible to simulate from the poste¬ 
rior distribution of P by introducing an auxiliary variable, so that C(p) cancels 
out in the M-H ratio. The exchange algorithm of Murray et al. [2006] improves 
the performance of this method and avoids the need for a fixed estimate of p. 
The drawback of this approach is that it requires perfect sampling from the 
stationary distribution of the Potts model. This is possible using coupling from 
the past [Propp and Wilson, 1996, Huber, 2003] but can be computationally 
prohibitive, particularly for large images. McGrory et al. [2009] reported that 
the time required for perfect sampling increased sharply for larger values of p. 
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(a) 2D Potts model with k = 6. 



(b) 3D Potts model with k = 9. 

Figure 6: Approximation of E z | j g[S(z)] by simulation for fixed values of /?, with 
linear interpolation. 
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For this reason, Cucala et al. [2009] substitute 500 iterations of Gibbs sampling 
on the auxiliary variable to produce an approximate sample from its stationary 
distribution. The details of this method are shown in Algorithm 3. 


Algorithm 3 Approximate exchange algorithm (MAVM) 
l: Initialise ft), /x 0 , er]], z 0 
2: for 1 £ 1... T do 

3: Update the labels Zi ~ p Zi , <r z .) p(zi\z\i, j3) Vi £ {1,..., n} 

4: Calculate sufficient statistics S(z) and yj,s | \/zi = j, Vj £ {1 

5: Update the noise parameters pj, <r| ~ p{yj, Sj\Pj, u|) n(pj\aj) 7r(crJ) 

6: Draw proposed parameter value ft ~ q(/3' |ft_i) 

7: Generate w|ft by sampling from Equation (2) 

8: Calculate the M-H acceptance ratio according to Equation (8): 

= . f, g(ft-iIftM/M exp |/? / S(z)} C(ft_i) exp {ft_iS(w)}C(ft) \ 

p mm ^ , g ( j g/| / 3 t _ 1 ) 7r (^ t _ 1 ) exp {^_ 1 s( z )}C(/3') exp (ftS(w)}C(ft_i)/ 

( 18 ) 

9: Draw u ~ Uniform[0,1] 

10: if u < p then 

11: fit <- P' 

12: else 

13: ft ft-i 

14: end if 

15: end for 


The M-H ratio given by Equation (18) can be considerably simplified if a 
uniform prior is used for j3 and ft is drawn from a symmetric M-H proposal 
density. In this case, log{p} simplifies to (ft — ft_i)S(z) + (ft_i — ft)S(w). 
The similarity with Equation (17) shows that the exchange algorithm is closely 
related to path sampling, since it is a form of importance sampling. An imple¬ 
mentation of the exchange algorithm is available in the online supplementary 
material accompanying Friel and Pettitt [2011]. Everitt [2012] provides source 
code for the approximate exchange algorithm with particle MCMC. 

4.4 Approximate Bayesian computation 

Like the exchange algorithm, ABC uses an auxiliary variable w to decide whether 
to accept or reject the proposed value of ft. Instead of a M-H ratio such as 
(18), the summary statistics of the auxiliary variable and the observed data 
are directly compared. The proposal is accepted if the distance between these 
summary statistics is within the ABC tolerance, e. This produces the following 
approximation when the labels z are observed without error: 

M/3 I z) « 7r e (ft | ||S(w) - S(z)|| < e) (19) 

where || • || is a suitable norm. In this paper we simply use the absolute difference 
between S(w) and S(z), since the summary statistic (5) is univariate. S(z) is 
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sufficient for f3, as noted by Grelaud et al. [2009], therefore the ABC approxi¬ 
mation (19) approaches the true posterior as n — > oo and e —>■ 0. In practice 
there is a tradeoff between the number of parameter values that are accepted 
and the size of the ABC tolerance. 

Everitt [2012] introduced ABC for noisy data, such as the Ising model, but 
only when the observations are discrete and so S(y) is defined. Moores et al. 
[2015] showed that ABC can also be applied when the domain of the observed 
data is continuous, such as with the additive Gaussian noise of Equation (4). In 
this type of model the posterior distribution of f3 does not depend directly on 
the observed data y, as shown by Equation (7). The ABC approximation (19) 
for the hidden Potts model becomes: 

p(P I y,z,0) ~ tt £ (/3 I ||S(w) - S(z)|| < e) (20) 

The other parameters can be updated using the current value of (3 and then the 
summary statistic S(z) can be computed from the current values of the labels, 
using an ABC within Gibbs approach as shown in Algorithm 4. Equation (20) 
can be considered as a moving target, since S(z) could change whenever the 
labels are updated. As a consequence, ABC with noisy data can be more prone 
to getting stuck in low probability regions of the parameter space. 

Grelaud et al. [2009] used a rejection sampler, where the proposals are drawn 
independently from the prior and thus q((3'\l3t-i) = Tr(/3). Under a sparse or 
uninformative prior, such as the uniform prior used in this study, too many 
proposals are rejected for this approach to be viable. Instead we have based 
Algorithm 4 on the ABC-MCMC algorithm of Marjoram et al. [2003]. This 
form of ABC algorithm is best suited for direct comparison with the other 
intractable likelihood methods in this article. 


Algorithm 4 ABC-MCMC with noisy data 
1: Initialise /3 0 , /4 0 , <Tq, z 0 
2: for I £ 1... T do 

3: Update the labels Zi ~ y Zi , <r z .) p(zi\z\i, /3)Vi £ {1,..., n} 

4: Calculate sufficient statistics S(z) and yj,s | \/zi = j, Vj £ {1 

5: Update the noise parameters fij, <r| ~ p{yj, s 2 \pj, cr|) 7r(/Xjjcr|) 7r(crj) 

6: Draw proposed parameter value (3' ~ q(/3'\(3t-i) 

7: Generate w|/3 7 by sampling from Equation (2) 

8: Draw u ~ Uniform[0,1] 

9: if u < and ||S(w) - S(z)|| < e then 

10 : /3 t 

11: else 

12: (31 Pt-1 

13: end if 

14: end for 


There have been many extensions to ABC, as reviewed by Marin et al. 
[2012]. Of particular interest are adaptive ABC algorithms that automatically 
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adjust the tolerance e and the proposal bandwidth ABC with sequen¬ 

tial Monte Carlo (SMC-ABC) algorithms use a sequence of target distributions 
7r Ct (/3 | ||S(w) — S(z)|| < e t ) such that ei > £2 > • • • > £t, where the number of 
SMC iterations T can be determined dynamically using a stopping rule. The 
SMC-ABC algorithm of Drovandi and Pettitt [2011] uses multiple MCMC steps 
for each SMC iteration, while the algorithm of Del Moral et al. [2012] uses mul¬ 
tiple replicates of the summary statistics for each particle. Everitt [2012] has 
provided a MATLAB implementation of SMC-ABC with the online supplemen¬ 
tary material accompanying his paper. 

The computational efficiency of ABC is dominated by the cost of drawing 
updates to the auxiliary variable, as reported by Everitt [2012]. Thus, we would 
expect that the execution time for ABC would be similar to the approximate 
exchange algorithm. Various approaches to improving this runtime have recently 
been proposed, such as the Gaussian process emulation of Wilkinson [2014], the 
“lazy ABC” of Prangle [2014], and methods involving auxiliary models [Cabras 
et ah, 2014, Moores et al., 2015, Buzbas and Rosenberg, 2015]. 

5 Experimental results 

5.1 Simulation study 

We generated the synthetic data using the R package PottsUtils [Feng, 2008, 
Feng and Tierney, 2014] as explained in Section 2.1. We used informative priors 
nim) ~ 7V(-0.15,0.05 2 ), 7 r(/x 2 ) ~ A7(0.05,0.05 2 ), 7 r(^ 3 ) - A/"(0.25, 0.05 2 ), and 
7 r((7|) ~ 1.5, 0.01) for the j £ 1... 3 mixture components and a uniform prior 

on the interval [0,1.3/3 cr jt] for the inverse temperature. For 2D images, (3 cr it was 
calculated using Equation (11) to be equal to 1.005. We used Equation (12) to 
obtain an approximate value for /3 cr n in 3D images of 0.552. 

All four algorithms were run for 10,000 iterations on each image, discarding 
the first 5,000 as burn-in. The differences in elapsed runtime are illustrated by 
Figure 7. All of the algorithms scaled linearly with increasing numbers of pixels, 
but the approximate exchange algorithm (MAVM) and ABC were roughly two 
orders of magnitude slower than either pseudolikelihood (PL) or path sampling 
(TI). For 2D images with n = 10 6 pixels the average runtime was 43 hours 
for MAVM, 42.5 hours for ABC, 36 minutes for PL and 29 minutes for TI. It 
took 30 minutes for precomputation of 64 values of E z | | g[S(z)] for path sampling, 
using 16 parallel threads. For 3D images with n = 10 6 voxels the averages were 
45 hours for MAVM or ABC, 35 minutes for PL and 30 minutes for TI. The 
precomputation step took 24 minutes for 44 values of /?. Even allowing for the 
cost of precomputation, TI was still much faster than MAVM or ABC. 

The distributions of the differences between the MCMC samples of (3 and 
the true value of the inverse temperature are illustrated by Figure 8 for 2D 
images with n = 10 b . Results for the other simulation studies are available in 
the supplementary material. It is evident that all four algorithms had difficulty 
with one of the images that had a very small value of /3 = 0.032. This appears to 
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(a) 2D images, k = 3. 



(b) 3D images, k = 3. 

Figure 7: Relationship between elapsed runtime and image size for pseudolikeli¬ 
hood (PL), the approximate exchange algorithm (MAVM), path sampling (TI), 
and approximate Bayesian computation (ABC). Both axes are on a logarithmic 
scale. 
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(d) ABC-MCMC. 


Figure 8: MCMC sampling error for /3 in simulated data for 20 images with 2 
dimensions, k = 3 and n = 10 6 . The diagonal, dashed line indicates the true 
value of (3, while the vertical, dotted line marks the critical point. 
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Method 

95% HPD interval for /3 

Iterations 

Elapsed 

CPU time 

Pseudolikelihood 

[2.260; 2.309] 

5K+55K 

2.0 h 

15.5ft 

Path sampling 

[1.248; 1.250] 

5K+55IC 

1.4ft 

10.4ft 

Exchange algorithm 

[1.275; 1.278] 

5K+5K 

50.8ft 

399.5ft 

ABC-MCMC 

[1.316; 1.323] 

5K+5K 

50.6ft 

398.2ft 


Table 2: Results for the satellite image of Brisbane, Australia. 


be due to the level of noise in the image. The algorithms produced an accurate 
estimate for another image with /3 = 0.033 and much lower levels of noise. We 
did not observe this behaviour with any of the 3D images, nor with the smaller 
2D images. We did, however, observe the errors increasing for values of /? above 
the critical point, indicated by the vertical line in Figure 8. In the worst case, PL 
underestimated j3 by 0.61, with a 95% highest posterior density (HPD) interval 
of [0.653; 0.706] when the true value of /? was 1.289. MAVM underestimated 
/3 for the same image by 0.215 and TI was out by 0.245. ABC was the most 
accurate for that image, with an error of less than 0.01. Nevertheless it produced 
erroneous estimates for other images, in one case underestimating /3 by 0.107 
and overestimating for another image by 0.150. This simulation study has shown 
that all four algorithms are prone to getting stuck in low-probability regions of 
the parameter space. 

5.2 Satellite remote sensing 

For the satellite image described in Section 2.2 we used weakly informative priors 
7r(/Xj) ~ A f(y, 5) and 7r(cr|) ~ IQ{ 1,0.01) for the k = 6 mixture components and 
7 t(/3) ~ U (0, 3) for the inverse temperature. We were unable to evaluate accuracy 
for this image because the true values of the inverse temperature and the pixel 
labels are unknown. However, we can still compare the 95% HPD intervals for 
/3 obtained from the four algorithms, as well as the time taken to fit the model. 
These results are summarised in Table 2. 

We ran pseudolikelihood and path sampling for a total of 60,000 iterations 
each, discarding the first 5,000 as burn-in. Due to the high computational 
cost of ABC-MCMC and the approximate exchange algorithm, we only ran 
these algorithms for 10,000 iterations with 5,000 burn-in, using 500 auxiliary 
iterations of Gibbs sampling. This took more than 50 hours, in comparison to 
2 hours for pseudolikelihood and 1.4 hours for path sampling, plus 33 minutes 
to precompute the expectation of the sufficient statistic for 64 values of /3. 
Pseudolikelihood was the fastest algorithm but it appears to have overestimated 
the value of j3 by a wide margin. There was no overlap between any of the 
credible intervals, but the other three algorithms all produced estimates in the 
neighbourhood of 1.3. This is higher than the critical value of /3, which is 1.24 
for a 2D image with k = 6. For comparison, we also analysed this image using 
PyMCMC [Strickland et ah, 2012] and the R package potts [Geyer and Johnson, 
2014]. PyMCMC produced a posterior estimate of 1.27, while the maximum 


22 





pseudolikelihood estimate (MPLE) returned by potts was 1.76. 

5.3 Medical images 

As explained in Section 2.3, the voxel intensity values in a CT scan are mea¬ 
sured in Hounsheld units, with water at 0 HU and air at -1024 HU. We used 
informative priors for the noise parameters 7r (fij) ~ Nirrij, 200 2 ) with mj £ 
{-800, -600, -400, -200, 0, 200,400,600, 800} and 7r(cr J) ~ TQ( 5,5 x 10" 4 ) for 
the k = 9 mixture components and 7r(/3) ~ IA{ 0,3) for the inverse temperature. 
The critical value of (3 for 3D images with k = 9 is approximately 0.858, as 
shown in Figure 4b. 

We ran pseudolikelihood and path sampling for 60,000 iterations on each 
of the 28 CT scans, but the approximate exchange algorithm and ABC were 
only run for 10,000, as we did for the satellite image and the synthetic data. 
The first 5,000 iterations were discarded as burn-in. The distributions of the 
MCMC samples for /? are illustrated in Figure 9a. The pooled 95% HPD in¬ 
tervals were [2.207; 2.281] for pseudolikelihood, [0.934; 1.019] for the exchange 
algorithm, [0.988; 0.992] for path sampling, and [0.938; 1.034] for ABC. Pseudo- 
likelihood has overestimated f3 in comparison to the other algorithms, as with 
the satellite image. The pooled credible intervals for the other three algorithms 
overlap with each other. 

The distribution of runtimes are shown in Figure 9b. It took an average 
of 106.6 hours for 10,000 iterations of the exchange algorithm and 115 hours 
for ABC-MCMC. The average runtime for 60,000 iterations of pseudolikelihood 
was 4.9 hours, while the average for path sampling was 3.5 hours. It took 38 
minutes to precompute 44 values of E z | ( g[S(z)] for path sampling. 


6 Concluding remarks 

We have compared the approximate exchange algorithm, path sampling, pseu¬ 
dolikelihood, and ABC using a simulation study as well as real data. All of 
these Markov chain Monte Carlo algorithms were prone to becoming stuck in 
low-probability regions of the parameter space. Path sampling provided the 
best tradeoff between accuracy and computational cost over all of the images 
that we tested. This was due to precomputation of E z | J g[S(z)] for fixed values of 
/3, an operation that is highly parallelisable for modern computer architectures. 
The output of this precomputation step can be reused for multiple images with 
similar dimensions. The only drawback of path sampling is that it fails to ac¬ 
count for the heteroskedasticity in the distribution of S(z). This can result in 
errors in the posterior distribution, particularly when (3 is above the critical 
value. Moores et al. [2015] introduced a precomputation step for ABC, with a 
second-order approximation that models the variance at the critical point. This 
produced a comparable runtime to path sampling, but without the outliers that 
are evident in Figure 8. 
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(a) Pooled posterior estimate for (5. 



(b) Elapsed time (hours). 

Figure 9: Results for 28 CT scans of the ED phantom obtained using pseudo¬ 
likelihood (PL), the approximate exchange algorithm (MAVM), path sampling 
(TI), and approximate Bayesian computation (ABC-MCMC). 
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We have shown that simulating auxiliary variables at every iteration is too 
computationally expensive for applications in image analysis. Exact inference is 
infeasible for the scale of data that is commonly encountered in scientific stud¬ 
ies, such as medical imaging and remote sensing. This has also been reported 
in previous studies, such as McGrory et al. [2009], Everitt [2012], Moores et al. 
[2015] and Moores and Mengersen [2014], but this is the first systematic com¬ 
parison of the scalability of these algorithms. Methods such as Prangle [2014] 
can reduce the number of auxiliary iterations that need to be performed, but 
not enough to compensate for the two orders of magnitude difference in runtime 
that we observe. 

Pseudolikelihood (PL) was the fastest of the four algorithms but produced 
very different estimates of (3 when applied to real data. The other three al¬ 
gorithms all depend on the value of the sufficient statistic S(z), while PL uses 
the product of the conditional densities. We have illustrated how the approx¬ 
imation error of PL increases for values of [3 above the critical point. This is 
due to long-range dependence between the latent labels. Composite likelihood 
methods such as Friel and Pettitt [2004] and Friel et al. [2009] can reduce the 
approximation error by computing the estimate using sub-lattices. It is possible 
that the real images that we used in this study have a more complex correlation 
structure than what can be captured by the Potts model. This could explain 
the differences in the PL estimates between the simulation study and the real 
data. 

This paper can serve as a guide to the selection of a suitable algorithm 
for Bayesian image analysis. We have provided our implementation of the four 
algorithms as an R package so that readers can experiment using their own data. 
There are faster algorithms for fitting a hidden Potts model, such as variational 
Bayes (VB) [McGrory et al., 2009] and iterated conditional modes (ICM) [Besag, 
1986], but these provide limited options for estimating the inverse temperature. 
We have shown that it is feasible to use MCMC for images of realistic size, given 
reasonable computational power by contemporary standards. This enables the 
posterior distribution of the inverse temperature to be estimated along with all 
of the other model parameters. 


SUPPLEMENTARY MATERIAL 

R package bayesImageS: R source package containing code to perform im¬ 
age segmentation using Markov chain Monte Carlo. Includes C++ imple¬ 
mentations of all of the methods described in the article. (GNU zipped 
tar file) 

Simulation study: Additional figures and tables of results for the simulation 
study of Section 5.1. (PDF) 
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